Draft analysis¶


Group name: D


Introduction¶

This section includes an introduction to the project motivation, data, and research question. Include a data dictionary

Setup¶

In [ ]:
import pandas as pd
import altair as alt
import numpy as np
from pandas import DataFrame
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

alt.data_transformers.disable_max_rows() #aus Code overview Histogramm
from scipy import stats # to compute the mode 

from sklearn.linear_model import LinearRegression #Fitting a line
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Lasso

import matplotlib.pyplot as plt  # To visualize

import joblib
import time

Definition für linksbündige Darstellung¶

In [ ]:
def left_align(df: DataFrame):
    left_aligned_df = df.style.set_properties(**{'text-align': 'left'})
    left_aligned_df = left_aligned_df.set_table_styles(
        [dict(selector='th', props=[('text-align', 'left')])]
    )
    return left_aligned_df

Data¶

Import data¶

In [ ]:
df_bevoelkerung = pd.read_csv(
    '../references/csv_Bevoelkerung/Zensus11_Datensatz_Bevoelkerung.csv',
    delimiter=';',
    dtype={
        'AGS_12': 'category',
        'RS_Land': 'category',
        'RS_RB_NUTS2': 'category',
        'RS_Kreis': 'category',
        'RS_VB': 'category',
        'RS_Gem': 'category',
        'Name': 'category',
        'Reg_Hier': 'category'
    }
)
/var/folders/k5/1ngg2lrs0p51m_z5s1q72vv00000gn/T/ipykernel_54101/3067287871.py:1: DtypeWarning: Columns (106,113,125,126,129,131,134,142,143,156,157,158,165,166,169,176,177,181,182,185,186,190,192,202,207,221) have mixed types. Specify dtype option on import or set low_memory=False.
  df_bevoelkerung = pd.read_csv(

Data structure¶

In [ ]:
df_bevoelkerung.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12544 entries, 0 to 12543
Columns: 223 entries, AGS_12 to BIL_5.8
dtypes: category(8), float64(41), int64(8), object(166)
memory usage: 21.4+ MB

Data corrections¶

Datatype Korrekturen durchführen, sodass danach nur noch Category oder float vorhanden ist:

  • interger in float verwandeln
  • / und - in 0-Werte verwandeln, da diese im engeren Sinne als 0 zählen
  • Zahlen in Klammern als normale Zahlen verwandeln
In [ ]:
# integers in float verwandeln
for column in df_bevoelkerung.select_dtypes(['int64']):
    df_bevoelkerung[column] = df_bevoelkerung[column].astype('float64')

df_bevoelkerung = df_bevoelkerung.replace('/',0)
df_bevoelkerung = df_bevoelkerung.replace('-', 0)

for column in df_bevoelkerung.select_dtypes('object'):
    df_bevoelkerung[column]=df_bevoelkerung[column].astype(str).str.extract('(\d+)').astype('float64')
In [ ]:
df_bevoelkerung.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12544 entries, 0 to 12543
Columns: 223 entries, AGS_12 to BIL_5.8
dtypes: category(8), float64(215)
memory usage: 21.4 MB

Check, ob Anpassung der Zahlung erfolgreich:

In [ ]:
df_bevoelkerung.loc[df_bevoelkerung['Name'] == 'Barkenholm', ['DEM_2.7']]
Out[ ]:
DEM_2.7
43 71.0
In [ ]:
df_bevoelkerung.loc[df_bevoelkerung['Name']=='Bergewöhrden', ['DEM_2.10']]
Out[ ]:
DEM_2.10
44 0.0

Variable lists¶

In [ ]:
df_predictor_variables = pd.read_excel('../references/Predictor Variables Definition.xlsx', index_col= 0)
In [ ]:
left_align(df_predictor_variables)
Out[ ]:
  Quote Berechnung
Variable    
Migrationshintergrund Migrationsquote (Anzahl Personen mit Migrationshintergrund / Anzahl Personen insgesamt)
Religionszugehörigkeit Christenquote* (Römisch-katholische Kirche + Evangelische Kirch) / Bevölkerung nach Religion gesamt
Geschlecht Männerquote (Anzahl Männer / Einwohner gesamt)
Bildungsniveau Akademikerquote** (Fach- oder Berufsakademie + FH-Abschluss + Hochschulabschluss + Promotion) / höchster beruflicher Abschluss insgesamt
Stellung im Beruf Beamtenquote (Anzahl Beamter / Erwerbstätige insgesamt)
Familienstand Singlequote*** (Anzahl Lediger + Verwitwete + Geschiedene + eingetragene Lebenspartnerschaft aufgehoben + Eingetragener Lebenspartner/- in verstorben + ohne Angaben) / Familienstand gesamt

Berechnung der Variablen im gesamten Dataset¶

In [ ]:
df_bevoelkerung['Arbeitslosenquote'] = df_bevoelkerung['ERW_1.10'] / df_bevoelkerung['ERW_1.4']*100
df_bevoelkerung['Arbeitslosenquote2'] = (1-(df_bevoelkerung['ERW_1.7'] / df_bevoelkerung['ERW_1.4']))*100
df_bevoelkerung['Migrationsquote'] = df_bevoelkerung['MIG_1.3'] / df_bevoelkerung['MIG_1.1']*100
df_bevoelkerung['Migrationsquote2'] = (1-(df_bevoelkerung['MIG_1.2'] / df_bevoelkerung['MIG_1.1']))*100
df_bevoelkerung['Christenquote'] = ((df_bevoelkerung['REL_1.2'] + df_bevoelkerung['REL_1.3']) / df_bevoelkerung['REL_1.1'])*100
df_bevoelkerung['Männerquote'] = (df_bevoelkerung['DEM_1.2']  / df_bevoelkerung['DEM_1.1'])*100
df_bevoelkerung['Akademikerquote'] = ((df_bevoelkerung['BIL_5.5'] + df_bevoelkerung['BIL_5.6'] + df_bevoelkerung['BIL_5.7'] + df_bevoelkerung['BIL_5.8'])  / df_bevoelkerung['BIL_5.1'])*100
df_bevoelkerung['Beamtenquote'] = (df_bevoelkerung['ERW_2.3'] / df_bevoelkerung['ERW_2.1'])*100
df_bevoelkerung['Singlequote'] = ((df_bevoelkerung['DEM_2.4'] + df_bevoelkerung['DEM_2.10'] + df_bevoelkerung['DEM_2.13'] + df_bevoelkerung['DEM_2.19'] + df_bevoelkerung['DEM_2.22'] + df_bevoelkerung['DEM_2.25']) / df_bevoelkerung['DEM_2.1'])*100
In [ ]:
df_bevoelkerung.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12544 entries, 0 to 12543
Columns: 232 entries, AGS_12 to Singlequote
dtypes: category(8), float64(224)
memory usage: 22.3 MB
In [ ]:
df_bevoelkerung.head()
Out[ ]:
AGS_12 RS_Land RS_RB_NUTS2 RS_Kreis RS_VB RS_Gem Name Reg_Hier AEWZ DEM_1.1 ... BIL_5.8 Arbeitslosenquote Arbeitslosenquote2 Migrationsquote Migrationsquote2 Christenquote Männerquote Akademikerquote Beamtenquote Singlequote
0 0 0 NaN NaN NaN NaN Deutschland Bund 80219695.0 80219695.0 ... 908970.0 4.652478 4.652501 19.205282 19.205282 59.132542 48.798417 16.541543 5.080131 54.203097
1 1 1 NaN NaN NaN NaN Schleswig-Holstein Land 2800119.0 2800119.0 ... 25380.0 4.578416 4.578416 12.024768 12.024768 57.512020 48.586649 14.198910 6.676007 54.187161
2 10010000000 1 0 1 0 0 Flensburg, Stadt Gemeinde 82258.0 82258.0 ... 0.0 6.657547 6.657547 15.957447 15.957447 56.027377 49.276666 13.355639 8.378114 62.147147
3 1001 1 0 1 NaN NaN Flensburg, Stadt Stadtkreis/kreisfreie Stadt/Landkreis 82258.0 82258.0 ... 0.0 6.657547 6.657547 15.957447 15.957447 56.027377 49.276666 13.355639 8.378114 62.147147
4 10020000000 1 0 2 0 0 Kiel, Landeshauptstadt Gemeinde 235782.0 235782.0 ... 4100.0 7.539341 7.539341 18.900021 18.900021 48.656386 48.139807 17.758138 7.578323 64.350968

5 rows × 232 columns

Data splitting¶

Dataframe auf relevante Spalten kürzen und auf Gemeinde bzw. Bundesländer filtern¶

Dataframe auf relevante Spalten filtern und in neues kopieren:

In [ ]:
df_analyse = df_bevoelkerung.iloc[:, [6,7,223,224,225,226,227,228,229,230,231]].copy()

NaN entfernen:

In [ ]:
df_analyse.dropna(inplace=True)
In [ ]:
df_analyse
Out[ ]:
Name Reg_Hier Arbeitslosenquote Arbeitslosenquote2 Migrationsquote Migrationsquote2 Christenquote Männerquote Akademikerquote Beamtenquote Singlequote
0 Deutschland Bund 4.652478 4.652501 19.205282 19.205282 59.132542 48.798417 16.541543 5.080131 54.203097
1 Schleswig-Holstein Land 4.578416 4.578416 12.024768 12.024768 57.512020 48.586649 14.198910 6.676007 54.187161
2 Flensburg, Stadt Gemeinde 6.657547 6.657547 15.957447 15.957447 56.027377 49.276666 13.355639 8.378114 62.147147
3 Flensburg, Stadt Stadtkreis/kreisfreie Stadt/Landkreis 6.657547 6.657547 15.957447 15.957447 56.027377 49.276666 13.355639 8.378114 62.147147
4 Kiel, Landeshauptstadt Gemeinde 7.539341 7.539341 18.900021 18.900021 48.656386 48.139807 17.758138 7.578323 64.350968
... ... ... ... ... ... ... ... ... ... ... ...
12492 Zeulenroda-Triebes, Stadt Gemeinde 5.662651 5.662651 3.547963 3.613666 28.863238 48.219960 12.058824 4.342273 51.721678
12497 Altenburger Land Stadtkreis/kreisfreie Stadt/Landkreis 7.879628 7.879628 1.858003 1.868383 17.894911 48.803457 12.459584 3.460133 50.574237
12498 Altenburg, Stadt Gemeinde 9.632751 9.632751 1.978736 1.978736 13.745967 47.926078 13.866232 3.530979 53.276621
12500 Meuselwitz, Stadt Gemeinde 9.363958 9.363958 2.098540 2.098540 11.446552 48.403867 8.841463 3.118908 49.872309
12502 Schmölln, Stadt Gemeinde 6.430868 6.430868 3.710095 3.710095 19.060231 48.560444 10.587103 3.608247 51.456642

2186 rows × 11 columns

Liste mit Prädikatoren:

In [ ]:
predictor = df_analyse.iloc[:,5:11].columns.values.tolist()
predictor
Out[ ]:
['Migrationsquote2',
 'Christenquote',
 'Männerquote',
 'Akademikerquote',
 'Beamtenquote',
 'Singlequote']

Dataframe auf Hierarchie-Ebene Gemeinde filtern.

In [ ]:
df_analyse_gemeinde = df_analyse[df_analyse['Reg_Hier']=='Gemeinde'].reset_index(drop=True)
In [ ]:
df_analyse_gemeinde
Out[ ]:
Name Reg_Hier Arbeitslosenquote Arbeitslosenquote2 Migrationsquote Migrationsquote2 Christenquote Männerquote Akademikerquote Beamtenquote Singlequote
0 Flensburg, Stadt Gemeinde 6.657547 6.657547 15.957447 15.957447 56.027377 49.276666 13.355639 8.378114 62.147147
1 Kiel, Landeshauptstadt Gemeinde 7.539341 7.539341 18.900021 18.900021 48.656386 48.139807 17.758138 7.578323 64.350968
2 Lübeck, Hansestadt Gemeinde 7.158110 7.167394 16.812500 16.812500 56.723806 47.470103 13.992802 6.540654 59.120801
3 Neumünster, Stadt Gemeinde 6.924644 6.899185 16.924489 16.924489 56.149594 48.816166 8.385235 6.015860 56.965139
4 Brunsbüttel, Stadt Gemeinde 5.365854 5.365854 13.682565 13.682565 62.607137 49.579243 6.877828 4.123711 51.277856
... ... ... ... ... ... ... ... ... ... ... ...
1568 Greiz, Stadt Gemeinde 6.813820 6.813820 2.112338 2.112338 25.183037 47.736997 13.745338 3.089598 53.023676
1569 Zeulenroda-Triebes, Stadt Gemeinde 5.662651 5.662651 3.547963 3.613666 28.863238 48.219960 12.058824 4.342273 51.721678
1570 Altenburg, Stadt Gemeinde 9.632751 9.632751 1.978736 1.978736 13.745967 47.926078 13.866232 3.530979 53.276621
1571 Meuselwitz, Stadt Gemeinde 9.363958 9.363958 2.098540 2.098540 11.446552 48.403867 8.841463 3.118908 49.872309
1572 Schmölln, Stadt Gemeinde 6.430868 6.430868 3.710095 3.710095 19.060231 48.560444 10.587103 3.608247 51.456642

1573 rows × 11 columns

In [ ]:
df_analyse_gemeinde.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1573 entries, 0 to 1572
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   Name                1573 non-null   category
 1   Reg_Hier            1573 non-null   category
 2   Arbeitslosenquote   1573 non-null   float64 
 3   Arbeitslosenquote2  1573 non-null   float64 
 4   Migrationsquote     1573 non-null   float64 
 5   Migrationsquote2    1573 non-null   float64 
 6   Christenquote       1573 non-null   float64 
 7   Männerquote         1573 non-null   float64 
 8   Akademikerquote     1573 non-null   float64 
 9   Beamtenquote        1573 non-null   float64 
 10  Singlequote         1573 non-null   float64 
dtypes: category(2), float64(9)
memory usage: 465.3 KB
In [ ]:
# define outcome variable as y_label
y_label = 'Arbeitslosenquote2'

# select features
features = predictor

# create feature data
X = df_analyse_gemeinde[features]

# create response
y = df_analyse_gemeinde[y_label]

Data Splitting - train & test data¶

In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2,
                                                    random_state=42)
In [ ]:
# data exploration set
df_train = pd.DataFrame(X_train.copy())
df_train = df_train.join(pd.DataFrame(y_train))
In [ ]:
df_train
Out[ ]:
Migrationsquote2 Christenquote Männerquote Akademikerquote Beamtenquote Singlequote Arbeitslosenquote2
1041 29.206142 73.971140 48.100815 18.275467 5.090006 53.898667 2.363636
277 18.195158 80.491887 49.356036 10.834050 4.965517 52.041039 2.945114
1223 33.358298 67.257531 47.849582 16.443128 3.807391 55.075046 3.563715
925 21.221751 75.438658 48.500299 16.352459 5.246523 50.564366 1.678061
1161 25.235602 85.163236 49.448950 10.391198 5.534351 50.207943 2.420857
... ... ... ... ... ... ... ...
1130 19.279854 60.266185 46.379427 30.991957 5.922747 53.891270 3.158770
1294 12.534626 81.905846 49.614947 12.569170 6.327373 51.597134 4.216074
860 19.919110 65.085772 48.678103 16.607774 4.291045 50.232089 3.423423
1459 4.362730 14.812994 48.360429 13.388544 2.231237 52.158948 10.526316
1126 27.406765 67.395069 49.459883 18.365288 5.927052 59.627278 2.806499

1258 rows × 7 columns

Dataframe auf Ebene Bundesland filtern:

In [ ]:
df_analyse_bund = df_analyse[df_analyse['Reg_Hier']=='Land'].reset_index(drop=True)
In [ ]:
df_analyse_bund
Out[ ]:
Name Reg_Hier Arbeitslosenquote Arbeitslosenquote2 Migrationsquote Migrationsquote2 Christenquote Männerquote Akademikerquote Beamtenquote Singlequote
0 Schleswig-Holstein Land 4.578416 4.578416 12.024768 12.024768 57.512020 48.586649 14.198910 6.676007 54.187161
1 Hamburg Land 5.664054 5.664054 28.301597 28.301597 39.577054 48.365438 23.635275 4.521244 61.686205
2 Niedersachsen Land 4.401018 4.401018 16.725965 16.726094 65.962115 48.904344 13.676174 5.798117 53.289024
3 Bremen Land 6.646302 6.646302 26.452131 26.452131 51.082332 48.566595 18.274408 5.146615 60.012015
4 Nordrhein-Westfalen Land 5.095187 5.095187 24.451495 24.451495 67.909493 48.567748 15.288151 5.245268 53.513457
5 Hessen Land 3.883143 3.883143 25.473128 25.473128 63.148731 48.793566 18.289708 4.978489 53.313280
6 Rheinland-Pfalz Land 3.787513 3.787048 19.088275 19.088023 74.492607 48.883355 13.877359 6.134863 52.208653
7 Baden-Württemberg Land 3.134949 3.134949 25.678058 25.678058 69.060549 48.943658 16.920449 4.981700 53.377529
8 Bayern Land 2.853099 2.853099 19.116721 19.116721 74.162141 48.902160 16.963287 5.053229 53.829801
9 Saarland Land 4.393987 4.395949 16.344238 16.343231 81.104376 48.523293 12.916943 5.696280 52.839520
10 Berlin Land 8.555266 8.555266 24.069973 24.069973 28.133363 48.586745 25.783922 5.450796 63.691085
11 Brandenburg Land 6.416525 6.417262 4.564780 4.564780 20.280685 49.203390 17.598278 5.558664 53.727695
12 Mecklenburg-Vorpommern Land 7.664427 7.664427 3.812100 3.812100 20.501422 49.263935 15.712481 4.245318 54.828659
13 Sachsen Land 6.528669 6.528669 4.388315 4.388315 23.819864 48.746980 17.744963 2.578026 54.621710
14 Sachsen-Anhalt Land 7.835750 7.835750 3.755953 3.755512 16.981950 48.841122 14.975257 3.601325 53.821271
15 Thüringen Land 5.669116 5.669116 3.531474 3.531474 30.951540 49.167477 16.028120 4.039632 53.412724

Variablen Migration:

In [ ]:
variables_migration = ['Reg_Hier','Name','ERW_1.4','ERW_1.10','MIG_1.1','MIG_1.2','MIG_1.3','MIG_1.4','MIG_1.5','MIG_1.6','MIG_1.7','MIG_1.8','MIG_1.9','MIG_1.10','MIG_1.11','MIG_2.1','MIG_2.2','MIG_2.3','MIG_2.4','MIG_2.5','MIG_2.6','MIG_2.7','MIG_2.8','MIG_3.1','MIG_3.2','MIG_3.3','MIG_3.4','MIG_3.5']

Variablen Religion:

In [ ]:
variables_religion = ['Reg_Hier','Name','ERW_1.4','ERW_1.10','REL_1.1','REL_1.2','REL_1.3','REL_1.4']

Variablen Geschlecht:

In [ ]:
variables_geschlecht= ['Reg_Hier','Name','ERW_1.4','ERW_1.10','DEM_1.1','DEM_1.2','DEM_1.3']

Variablen Bildung:

In [ ]:
variables_bildung= ['Reg_Hier','Name','ERW_1.4','ERW_1.10','BIL_5.1','BIL_5.2','BIL_5.3','BIL_5.4','BIL_5.5','BIL_5.6','BIL_5.7','BIL_5.8']

Variablen Beruf:

In [ ]:
variables_beruf = ['Reg_Hier','Name','ERW_1.4','ERW_1.10','ERW_2.1','ERW_2.2','ERW_2.3','ERW_2.4','ERW_2.5','ERW_2.6']

Variablen Familien:

In [ ]:
variables_familien = ['Reg_Hier','Name','ERW_1.4','ERW_1.10','DEM_2.1','DEM_2.4','DEM_2.7','DEM_2.10','DEM_2.13','DEM_2.16','DEM_2.19','DEM_2.22','DEM_2.25']

Analysis¶

Descriptive statistics¶

In [ ]:
df_analyse_gemeinde.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
Arbeitslosenquote 1573.0 4.129315 2.237428 0.000000 2.794760 3.618907 5.016766 16.871705
Arbeitslosenquote2 1573.0 4.225062 2.100031 0.780031 2.796174 3.619303 5.022500 16.871705
Migrationsquote 1573.0 17.750686 9.632732 0.000000 10.633649 17.838900 24.120116 53.932014
Migrationsquote2 1573.0 17.752142 9.628915 0.850662 10.663616 17.836257 24.120116 53.982750
Christenquote 1573.0 62.137232 20.945943 5.933338 58.604711 68.010543 75.942976 93.909239
Männerquote 1573.0 48.697872 0.843340 45.102669 48.190799 48.702359 49.177376 54.993659
Akademikerquote 1573.0 13.770178 5.853955 2.092871 9.776536 12.344777 16.242999 47.997457
Beamtenquote 1573.0 5.067968 1.740783 1.236476 3.889789 4.892966 6.017192 18.870728
Singlequote 1573.0 52.214032 3.013940 44.397914 50.252657 51.690254 53.661406 66.462950
In [ ]:
df_analyse_gemeinde_long = df_analyse_gemeinde.iloc[:,2:11].melt(var_name="Quotenname",value_name="Quote")
df_analyse_gemeinde_long
Out[ ]:
Quotenname Quote
0 Arbeitslosenquote 6.657547
1 Arbeitslosenquote 7.539341
2 Arbeitslosenquote 7.158110
3 Arbeitslosenquote 6.924644
4 Arbeitslosenquote 5.365854
... ... ...
14152 Singlequote 53.023676
14153 Singlequote 51.721678
14154 Singlequote 53.276621
14155 Singlequote 49.872309
14156 Singlequote 51.456642

14157 rows × 2 columns

In [ ]:
alt.Chart(df_analyse_gemeinde_long).mark_area(
    opacity=0.5,
    interpolate='step'
).encode(
    alt.X('Quote:Q', bin=alt.Bin(maxbins=50)),
    alt.Y('count()', stack=None),
    alt.Color('Quotenname:N'),
    tooltip = ['Quotenname']
).properties(
    title="Zusammenfassendes Histogramm der Quoten"
).interactive()
Out[ ]:
In [ ]:
Quoten = df_analyse_gemeinde.iloc[:,2:12].columns.values.tolist()
In [ ]:
alt.Chart(df_analyse_gemeinde, width=200, height=150).mark_bar().encode(
    alt.X(alt.repeat("repeat"), type="quantitative", bin=True),
    y='count()'
).repeat(
    repeat=Quoten,
    columns=3
)
Out[ ]:
In [ ]:
source = df_analyse_gemeinde
hist = alt.Chart(source).mark_bar().encode(
    x=alt.X("Arbeitslosenquote2", 
            bin=True),
    y='count()',
)
In [ ]:
# Boxplot 
box = alt.Chart(source).mark_boxplot().encode(
    x='Arbeitslosenquote2',
)
In [ ]:
alt.vconcat(hist, box)
Out[ ]:
In [ ]:
df_analyse_gemeinde["Arbeitslosenquote2"].describe()
Out[ ]:
count    1573.000000
mean        4.225062
std         2.100031
min         0.780031
25%         2.796174
50%         3.619303
75%         5.022500
max        16.871705
Name: Arbeitslosenquote2, dtype: float64
In [ ]:
hist = alt.Chart(source).mark_bar().encode(
    alt.X(
        "Arbeitslosenquote2",
        bin=True,
        scale=alt.Scale(zero=True)
    ),
    alt.Y('count()')
)
box = alt.Chart(source).mark_boxplot().encode(
    x=alt.X(
        'Arbeitslosenquote2',
        scale=alt.Scale(zero=True)
    )
)

print(source['Arbeitslosenquote2'].describe())
alt.vconcat(hist, box,).properties(title='Übersicht Arbeitslosenquote').configure_title(fontSize = 20)
count    1573.000000
mean        4.225062
std         2.100031
min         0.780031
25%         2.796174
50%         3.619303
75%         5.022500
max        16.871705
Name: Arbeitslosenquote2, dtype: float64
Out[ ]:
In [ ]:
hist = alt.Chart(source).mark_bar().encode(
    alt.X(
        "Migrationsquote",
        bin=True,
        scale=alt.Scale(zero=True)
    ),
    alt.Y('count()')
)
box = alt.Chart(source).mark_boxplot().encode(
    x=alt.X(
        'Migrationsquote',
        scale=alt.Scale(zero=True)
    )
)

print(source['Migrationsquote'].describe())
Migrationsquote = alt.vconcat(hist, box,).properties(title='Übersicht Migrationsquote').configure_title(fontSize = 20)
Migrationsquote
count    1573.000000
mean       17.750686
std         9.632732
min         0.000000
25%        10.633649
50%        17.838900
75%        24.120116
max        53.932014
Name: Migrationsquote, dtype: float64
Out[ ]:
In [ ]:
hist = alt.Chart(source).mark_bar().encode(
    alt.X(
        "Migrationsquote2",
        bin=True,
        scale=alt.Scale(zero=True)
    ),
    alt.Y('count()')
)
box = alt.Chart(source).mark_boxplot().encode(
    x=alt.X(
        'Migrationsquote2',
        scale=alt.Scale(zero=True)
    )
)

print(source['Migrationsquote2'].describe())
alt.vconcat(hist, box,).properties(title='Übersicht Migrationsquote2').configure_title(fontSize = 20)
count    1573.000000
mean       17.752142
std         9.628915
min         0.850662
25%        10.663616
50%        17.836257
75%        24.120116
max        53.982750
Name: Migrationsquote2, dtype: float64
Out[ ]:
In [ ]:
hist = alt.Chart(source).mark_bar().encode(
    alt.X(
        "Christenquote",
        bin=True,
        scale=alt.Scale(zero=True)
    ),
    alt.Y('count()')
)
box = alt.Chart(source).mark_boxplot().encode(
    x=alt.X(
        'Christenquote',
        scale=alt.Scale(zero=True)
    )
)

print(source['Christenquote'].describe())
alt.vconcat(hist, box,).properties(title='Übersicht Christenquote').configure_title(fontSize = 20)
count    1573.000000
mean       62.137232
std        20.945943
min         5.933338
25%        58.604711
50%        68.010543
75%        75.942976
max        93.909239
Name: Christenquote, dtype: float64
Out[ ]:
In [ ]:
hist = alt.Chart(source).mark_bar().encode(
    alt.X(
        "Männerquote",
        bin=True,
        scale=alt.Scale(zero=True)
    ),
    alt.Y('count()')
)
box = alt.Chart(source).mark_boxplot().encode(
    x=alt.X(
        'Männerquote',
        scale=alt.Scale(zero=True)
    )
)

print(source['Männerquote'].describe())
alt.vconcat(hist, box,).properties(title='Übersicht Männerquote').configure_title(fontSize = 20)
count    1573.000000
mean       48.697872
std         0.843340
min        45.102669
25%        48.190799
50%        48.702359
75%        49.177376
max        54.993659
Name: Männerquote, dtype: float64
Out[ ]:
In [ ]:
hist = alt.Chart(source).mark_bar().encode(
    alt.X(
        "Akademikerquote",
        bin=True,
        scale=alt.Scale(zero=True)
    ),
    alt.Y('count()')
)
box = alt.Chart(source).mark_boxplot().encode(
    x=alt.X(
        'Akademikerquote',
        scale=alt.Scale(zero=True)
    )
)

print(source['Akademikerquote'].describe())
alt.vconcat(hist, box,).properties(title='Übersicht Akademikerquote').configure_title(fontSize = 20)
count    1573.000000
mean       13.770178
std         5.853955
min         2.092871
25%         9.776536
50%        12.344777
75%        16.242999
max        47.997457
Name: Akademikerquote, dtype: float64
Out[ ]:
In [ ]:
hist = alt.Chart(source).mark_bar().encode(
    alt.X(
        "Beamtenquote",
        bin=True,
        scale=alt.Scale(zero=True)
    ),
    alt.Y('count()')
)
box = alt.Chart(source).mark_boxplot().encode(
    x=alt.X(
        'Beamtenquote',
        scale=alt.Scale(zero=True)
    )
)

print(source['Beamtenquote'].describe())
alt.vconcat(hist, box,).properties(title='Übersicht Beamtenquote').configure_title(fontSize = 20)
count    1573.000000
mean        5.067968
std         1.740783
min         1.236476
25%         3.889789
50%         4.892966
75%         6.017192
max        18.870728
Name: Beamtenquote, dtype: float64
Out[ ]:
In [ ]:
hist = alt.Chart(source).mark_bar().encode(
    alt.X(
        "Singlequote",
        bin=True,
        scale=alt.Scale(zero=True)
    ),
    alt.Y('count()')
)
box = alt.Chart(source).mark_boxplot().encode(
    x=alt.X(
        'Singlequote',
        scale=alt.Scale(zero=True)
    )
)

print(source['Singlequote'].describe())
alt.vconcat(hist, box,).properties(title='Übersicht Singlequote').configure_title(fontSize = 20)
count    1573.000000
mean       52.214032
std         3.013940
min        44.397914
25%        50.252657
50%        51.690254
75%        53.661406
max        66.462950
Name: Singlequote, dtype: float64
Out[ ]:

Exploratory data analysis¶

Relationships¶

In [ ]:
alt.Chart(source, width=200, height=150).mark_circle(size=60).encode(
    alt.X(
        alt.repeat("repeat"), 
        type="quantitative",
        scale=alt.Scale(zero=False)),
    alt.Y('Arbeitslosenquote2'),
    tooltip = ['Name',alt.Tooltip(alt.repeat("repeat"), type="quantitative"), alt.Y('Arbeitslosenquote2')]
).repeat(
    repeat=predictor,
    columns=4
).interactive()
Out[ ]:
In [ ]:
alt.Chart(source).mark_circle(size=60).encode(
    x=alt.X('Migrationsquote2'),
    y=alt.Y('Arbeitslosenquote2', 
            title='ALO_Quote'),
    tooltip=['Migrationsquote2', 'Arbeitslosenquote2','Name']
).interactive()
Out[ ]:
In [ ]:
corr_data = source[['Migrationsquote2','Arbeitslosenquote2']]
corr = corr_data.corr(method='pearson').round(5)
corr
Out[ ]:
Migrationsquote2 Arbeitslosenquote2
Migrationsquote2 1.00000 -0.27309
Arbeitslosenquote2 -0.27309 1.00000
In [ ]:
corr_blues = corr.style.background_gradient(cmap='Blues')
corr_blues
Out[ ]:
  Migrationsquote2 Arbeitslosenquote2
Migrationsquote2 1.000000 -0.273090
Arbeitslosenquote2 -0.273090 1.000000
In [ ]:
corr_list = corr['Arbeitslosenquote2'].sort_values(ascending=False)
corr_list
Out[ ]:
Arbeitslosenquote2    1.00000
Migrationsquote2     -0.27309
Name: Arbeitslosenquote2, dtype: float64
In [ ]:
# inspect correlation between outcome and possible predictors
corr = df_train.corr(method = 'pearson').round(5)
corr[y_label].sort_values(ascending=False)
Out[ ]:
Arbeitslosenquote2    1.00000
Singlequote           0.44441
Akademikerquote      -0.10260
Männerquote          -0.23303
Beamtenquote         -0.25672
Migrationsquote2     -0.29545
Christenquote        -0.66274
Name: Arbeitslosenquote2, dtype: float64
In [ ]:
# take a look at all correlations
corr.style.background_gradient(cmap='Blues')
Out[ ]:
  Migrationsquote2 Christenquote Männerquote Akademikerquote Beamtenquote Singlequote Arbeitslosenquote2
Migrationsquote2 1.000000 0.430310 -0.049660 0.093810 -0.014510 0.063220 -0.295450
Christenquote 0.430310 1.000000 0.153050 -0.253270 0.286860 -0.315040 -0.662740
Männerquote -0.049660 0.153050 1.000000 -0.303060 -0.091700 -0.274020 -0.233030
Akademikerquote 0.093810 -0.253270 -0.303060 1.000000 0.273150 0.239260 -0.102600
Beamtenquote -0.014510 0.286860 -0.091700 0.273150 1.000000 -0.039460 -0.256720
Singlequote 0.063220 -0.315040 -0.274020 0.239260 -0.039460 1.000000 0.444410
Arbeitslosenquote2 -0.295450 -0.662740 -0.233030 -0.102600 -0.256720 0.444410 1.000000

Model¶

Select model¶

In [ ]:
data = df_analyse_gemeinde.dropna()[['Name','Migrationsquote2', 'Arbeitslosenquote2']]
data
Out[ ]:
Name Migrationsquote2 Arbeitslosenquote2
0 Flensburg, Stadt 15.957447 6.657547
1 Kiel, Landeshauptstadt 18.900021 7.539341
2 Lübeck, Hansestadt 16.812500 7.167394
3 Neumünster, Stadt 16.924489 6.899185
4 Brunsbüttel, Stadt 13.682565 5.365854
... ... ... ...
1568 Greiz, Stadt 2.112338 6.813820
1569 Zeulenroda-Triebes, Stadt 3.613666 5.662651
1570 Altenburg, Stadt 1.978736 9.632751
1571 Meuselwitz, Stadt 2.098540 9.363958
1572 Schmölln, Stadt 3.710095 6.430868

1573 rows × 3 columns

In [ ]:
data.describe()
Out[ ]:
Migrationsquote2 Arbeitslosenquote2
count 1573.000000 1573.000000
mean 17.752142 4.225062
std 9.628915 2.100031
min 0.850662 0.780031
25% 10.663616 2.796174
50% 17.836257 3.619303
75% 24.120116 5.022500
max 53.982750 16.871705
In [ ]:
y_label = "Arbeitslosenquote2"

X_test2 = data[["Migrationsquote2"]]
y_test2 = data[y_label]
In [ ]:
# Choose the linear regression model
reg_test = LinearRegression()
In [ ]:
# Fit the model to the data
reg_test.fit(X_test2, y_test2)
Out[ ]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [ ]:
print(f' Intercept: {reg_test.intercept_:.4} \n Slope: {reg_test.coef_[0]:.3}')
 Intercept: 5.282 
 Slope: -0.0596
In [ ]:
# Intercept
reg_test.intercept_
Out[ ]:
5.282378184546983
In [ ]:
# Slope
reg_test.coef_
Out[ ]:
array([-0.0595599])
In [ ]:
# Make predictions on the data
y_pred = reg_test.predict(X_test2)
y_pred
Out[ ]:
array([4.33195426, 4.15669481, 4.28102737, ..., 5.16452487, 5.15738934,
       5.0614053 ])
In [ ]:
mean_squared_error(y_test2, y_pred)
Out[ ]:
4.078635290942068
In [ ]:
mean_squared_error(y_test2, y_pred, squared=False)
Out[ ]:
2.0195631435887487
In [ ]:
x_Scatter =  data['Migrationsquote2']
y_Scatter = data['Arbeitslosenquote2']

X_test2 = data[["Migrationsquote2"]]
y_pred = y_pred

plt.scatter(x_Scatter, y_Scatter)
plt.plot(X_test2, y_pred, color='red')
plt.show()
In [ ]:
reg_mig = LinearRegression()
reg_chr = LinearRegression()
reg_sin = LinearRegression()
reg_multi=LinearRegression()

Select Model Lasso¶

In [ ]:
X_train_lasso = X_train.copy()
X_test_lasso = X_test.copy()
scaler = StandardScaler().fit(X_train[features]) 

X_train_lasso[features] = scaler.transform(X_train_lasso[features])
X_test_lasso[features] = scaler.transform(X_test_lasso[features])
In [ ]:
X_train
Out[ ]:
Migrationsquote2 Christenquote Männerquote Akademikerquote Beamtenquote Singlequote
1041 29.206142 73.971140 48.100815 18.275467 5.090006 53.898667
277 18.195158 80.491887 49.356036 10.834050 4.965517 52.041039
1223 33.358298 67.257531 47.849582 16.443128 3.807391 55.075046
925 21.221751 75.438658 48.500299 16.352459 5.246523 50.564366
1161 25.235602 85.163236 49.448950 10.391198 5.534351 50.207943
... ... ... ... ... ... ...
1130 19.279854 60.266185 46.379427 30.991957 5.922747 53.891270
1294 12.534626 81.905846 49.614947 12.569170 6.327373 51.597134
860 19.919110 65.085772 48.678103 16.607774 4.291045 50.232089
1459 4.362730 14.812994 48.360429 13.388544 2.231237 52.158948
1126 27.406765 67.395069 49.459883 18.365288 5.927052 59.627278

1258 rows × 6 columns

In [ ]:
# select the lasso model with built in crossvalidation
reg = LassoCV(cv=5, random_state=0)

Training and validation¶

In [ ]:
# cross-validation with 5 folds
scores_mig = cross_val_score(reg_mig, X_train[['Migrationsquote2']], y_train, cv=5, scoring='neg_mean_squared_error') *-1
scores_chr = cross_val_score(reg_chr, X_train[['Christenquote']], y_train, cv=5, scoring='neg_mean_squared_error') *-1
scores_sin = cross_val_score(reg_sin, X_train[['Singlequote']], y_train, cv=5, scoring='neg_mean_squared_error') *-1
# cross-validation with 5 folds total
scores = cross_val_score(reg_multi, X_train, y_train, cv=5, scoring='neg_mean_squared_error') *-1
In [ ]:
# store cross-validation scores: Migrationsquote, Christenquote und Singlequote
df_scores_mig = pd.DataFrame({"lr": scores_mig})
df_scores_chr = pd.DataFrame({"lr": scores_chr})
df_scores_sin = pd.DataFrame({"lr": scores_sin})

# reset index to match the number of folds
df_scores_mig.index += 1
df_scores_chr.index += 1
df_scores_sin.index += 1

# print dataframe

df_scores_mig.style.background_gradient(cmap='Blues')
Out[ ]:
  lr
1 3.914431
2 3.731555
3 3.467859
4 4.145899
5 4.440537
In [ ]:
#Christenquote
df_scores_chr.style.background_gradient(cmap='Blues')
Out[ ]:
  lr
1 2.060155
2 2.423749
3 2.352862
4 2.447104
5 2.804205
In [ ]:
#Singlequote
df_scores_sin.style.background_gradient(cmap='Blues')
Out[ ]:
  lr
1 3.449375
2 3.092362
3 2.863533
4 3.594474
5 4.297317
In [ ]:
# store cross-validation scores Multiple Regression
df_scores = pd.DataFrame({"lr": scores})

# reset index to match the number of folds Multiple Regression
df_scores.index += 1

# print dataframe
df_scores.style.background_gradient(cmap='Blues')
Out[ ]:
  lr
1 1.443932
2 1.526375
3 1.385411
4 1.527816
5 2.038018

Chart Folds Migrationsquote

In [ ]:
alt.Chart(df_scores_mig.reset_index()).mark_line(
     point=alt.OverlayMarkDef()
).encode(
    x=alt.X("index", bin=False, title="Fold", axis=alt.Axis(tickCount=5)),
    y=alt.Y("lr", aggregate="mean", title="Mean squared error (MSE)")
)
Out[ ]:

Chart Folds Christenquote

In [ ]:
alt.Chart(df_scores_chr.reset_index()).mark_line(
     point=alt.OverlayMarkDef()
).encode(
    x=alt.X("index", bin=False, title="Fold", axis=alt.Axis(tickCount=5)),
    y=alt.Y("lr", aggregate="mean", title="Mean squared error (MSE)")
)
Out[ ]:

Chart Folds Singlequote

In [ ]:
alt.Chart(df_scores_sin.reset_index()).mark_line(
     point=alt.OverlayMarkDef()
).encode(
    x=alt.X("index", bin=False, title="Fold", axis=alt.Axis(tickCount=5)),
    y=alt.Y("lr", aggregate="mean", title="Mean squared error (MSE)")
)
Out[ ]:

Chart Folds Multiple Regression

In [ ]:
alt.Chart(df_scores.reset_index()).mark_line(
     point=alt.OverlayMarkDef()
).encode(
    x=alt.X("index", bin=False, title="Fold", axis=alt.Axis(tickCount=5)),
    y=alt.Y("lr", aggregate="mean", title="Mean squared error (MSE)")
)
Out[ ]:
In [ ]:
df_scores_mig.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
lr 5.0 3.940056 0.37415 3.467859 3.731555 3.914431 4.145899 4.440537
In [ ]:
df_scores_chr.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
lr 5.0 2.417615 0.265673 2.060155 2.352862 2.423749 2.447104 2.804205
In [ ]:
df_scores_sin.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
lr 5.0 3.459412 0.550051 2.863533 3.092362 3.449375 3.594474 4.297317
In [ ]:
df_scores.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
lr 5.0 1.584311 0.260608 1.385411 1.443932 1.526375 1.527816 2.038018

Training & Best Alpha Lasso Regression¶

In [ ]:
reg.fit(X_train_lasso, y_train)
Out[ ]:
LassoCV(cv=5, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LassoCV(cv=5, random_state=0)
In [ ]:
reg.alpha_
Out[ ]:
0.0063775417634839085

Fit model¶

In [ ]:
# Fit the model to the complete training data
reg_mig.fit(X_train[['Migrationsquote2']], y_train)
reg_chr.fit(X_train[['Christenquote']], y_train)
reg_sin.fit(X_train[['Singlequote']], y_train)
Out[ ]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [ ]:
# Fit the model to the complete training data
reg_multi.fit(X_train, y_train)
Out[ ]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()

Migrationsquote

In [ ]:
# intercept
intercept = pd.DataFrame({
    "Name": ["Intercept"],
    "Coefficient":[reg_mig.intercept_]}
    )

# make a slope table
slope = pd.DataFrame({
    "Name": 'slope',
    "Coefficient": reg_mig.coef_}
)

# combine estimates of intercept and slope
table = pd.concat([intercept, slope], ignore_index=True, sort=False)

round(table, 3)
Out[ ]:
Name Coefficient
0 Intercept 5.342
1 slope -0.064

Christenquote

In [ ]:
# intercept
intercept = pd.DataFrame({
    "Name": ["Intercept"],
    "Coefficient":[reg_chr.intercept_]}
    )

# make a slope table
slope = pd.DataFrame({
    "Name": 'slope',
    "Coefficient": reg_chr.coef_}
)

# combine estimates of intercept and slope
table = pd.concat([intercept, slope], ignore_index=True, sort=False)

round(table, 3)
Out[ ]:
Name Coefficient
0 Intercept 8.241
1 slope -0.065

Singlequote

In [ ]:
# intercept
intercept = pd.DataFrame({
    "Name": ["Intercept"],
    "Coefficient":[reg_sin.intercept_]}
    )

# make a slope table
slope = pd.DataFrame({
    "Name": 'slope',
    "Coefficient": reg_sin.coef_}
)

# combine estimates of intercept and slope
table = pd.concat([intercept, slope], ignore_index=True, sort=False)

round(table, 3)
Out[ ]:
Name Coefficient
0 Intercept -12.168
1 slope 0.314

Multiple Regression

In [ ]:
# intercept
intercept = pd.DataFrame({
    "Name": ["Intercept"],
    "Coefficient":[reg_multi.intercept_]}
    )

# make a slope table
slope = pd.DataFrame({
    "Name": features,
    "Coefficient": reg_multi.coef_}
)

# combine estimates of intercept and slopes
table = pd.concat([intercept, slope], ignore_index=True, sort=False)

round(table, 3)
Out[ ]:
Name Coefficient
0 Intercept 20.239
1 Migrationsquote2 -0.000
2 Christenquote -0.064
3 Männerquote -0.428
4 Akademikerquote -0.147
5 Beamtenquote 0.044
6 Singlequote 0.203

Fit Model Lasso Regression¶

In [ ]:
# Fit the model to the complete training data
reg = Lasso(alpha=reg.alpha_)
reg.fit(X_train_lasso, y_train)
Out[ ]:
Lasso(alpha=0.0063775417634839085)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Lasso(alpha=0.0063775417634839085)
In [ ]:
# intercept
intercept = pd.DataFrame({
    "Name": ["Intercept"],
    "Coefficient":[reg.intercept_]}
    )

# make a slope table
slope = pd.DataFrame({
    "Name": features,
    "Coefficient": reg.coef_}
)

# combine estimates of intercept and slopes
table = pd.concat([intercept, slope], ignore_index=True, sort=False)

round(table, 3)
Out[ ]:
Name Coefficient
0 Intercept 4.215
1 Migrationsquote2 -0.003
2 Christenquote -1.350
3 Männerquote -0.349
4 Akademikerquote -0.812
5 Beamtenquote 0.062
6 Singlequote 0.591

Evaluation on test set¶

In [ ]:
# obtain predictions
y_pred_mig = reg_mig.predict(X_test[['Migrationsquote2']])
y_pred_chr = reg_chr.predict(X_test[['Christenquote']])
y_pred_sin = reg_sin.predict(X_test[['Singlequote']])
y_pred_multi = reg_multi.predict(X_test)
In [ ]:
test = pd.DataFrame(
    {"a": [15, 30, 20]}
)
test
Out[ ]:
a
0 15
1 30
2 20
In [ ]:
reg_mig.predict(test)
Out[ ]:
array([4.3892813 , 3.43630654, 4.07162304])
In [ ]:
X_test[['Migrationsquote2']]
Out[ ]:
Migrationsquote2
1120 18.461538
810 15.162791
1339 3.671189
534 18.630933
514 14.435390
... ...
1263 21.283255
1281 10.962963
1209 12.860013
1007 22.473868
1404 5.213904

315 rows × 1 columns

In [ ]:
# R squared Migrationsquote
r2_score(y_test, y_pred_mig).round(3)
Out[ ]:
0.027
In [ ]:
# R squared Christenquote
r2_score(y_test, y_pred_chr).round(3)
Out[ ]:
0.382
In [ ]:
# R squared Singlequote
r2_score(y_test, y_pred_sin).round(3)
Out[ ]:
0.121
In [ ]:
# R squared Multiple Regressio
r2_score(y_test, y_pred_multi).round(3)
Out[ ]:
0.636
In [ ]:
#adjusted R squared Migrationsquote- gem. Buch Chapter 8 
print((1-(1-r2_score(y_test, y_pred_mig))*((len(X_test[['Migrationsquote2']])-1)/(len(X_test[['Migrationsquote2']])-len(X_test[['Migrationsquote2']].columns)-1))).round(3))
0.024
In [ ]:
#adjusted R squared Christenquote- gem. Buch Chapter 8 
print((1-(1-r2_score(y_test, y_pred_chr))*((len(X_test[['Christenquote']])-1)/(len(X_test[['Christenquote']])-len(X_test[['Christenquote']].columns)-1))).round(3))
0.38
In [ ]:
#adjusted R squared Singlequote- gem. Buch Chapter 8 
print((1-(1-r2_score(y_test, y_pred_sin))*((len(X_test[['Singlequote']])-1)/(len(X_test[['Singlequote']])-len(X_test[['Singlequote']].columns)-1))).round(3))
0.118
In [ ]:
#adjusted R squared Multiple Regression - gem. Buch Chapter 8 
print((1-(1-r2_score(y_test, y_pred_multi))*((len(X_test)-1)/(len(X_test)-len(X_test.columns)-1))).round(3))
0.629
In [ ]:
# MSE Migrationsquote
mean_squared_error(y_test, y_pred_mig).round(3)
Out[ ]:
4.708
In [ ]:
# MSE Christenquote
mean_squared_error(y_test, y_pred_chr).round(3)
Out[ ]:
2.991
In [ ]:
# MSE Singlequote
mean_squared_error(y_test, y_pred_sin).round(3)
Out[ ]:
4.254
In [ ]:
# MSE Multiple Regression
mean_squared_error(y_test, y_pred_multi).round(3)
Out[ ]:
1.762
In [ ]:
# RMSE Migrationsquote
mean_squared_error(y_test, y_pred_mig, squared=False).round(3)
Out[ ]:
2.17
In [ ]:
# RMSE Christenquote
mean_squared_error(y_test, y_pred_chr, squared=False).round(3)
Out[ ]:
1.729
In [ ]:
# RMSE Singlequote
mean_squared_error(y_test, y_pred_sin, squared=False).round(3)
Out[ ]:
2.063
In [ ]:
# RMSE Multiple Regression
mean_squared_error(y_test, y_pred_multi, squared=False).round(3)
Out[ ]:
1.327
In [ ]:
# MAE Migrationsquote
mean_absolute_error(y_test, y_pred_mig).round(3)
Out[ ]:
1.686
In [ ]:
# MAE Christenquote
mean_absolute_error(y_test, y_pred_chr).round(3)
Out[ ]:
1.296
In [ ]:
# MAE Singlequote
mean_absolute_error(y_test, y_pred_sin).round(3)
Out[ ]:
1.492
In [ ]:
# MAE Multiple Regression
mean_absolute_error(y_test, y_pred_multi).round(3)
Out[ ]:
1.004

Evaluation on test set Lasso Regression¶

In [ ]:
# obtain predictions
y_pred_lasso = reg.predict(X_test_lasso)
In [ ]:
# R squared
r2_score(y_test, y_pred_lasso).round(3)
Out[ ]:
0.635
In [ ]:
#adjusted R squared - gem. Buch Chapter 8 
print((1-(1-r2_score(y_test, y_pred_lasso))*((len(X_test_lasso)-1)/(len(X_test_lasso)-len(X_test_lasso.columns)-1))).round(3))
0.628
In [ ]:
# MSE
mean_squared_error(y_test, y_pred_lasso).round(3)
Out[ ]:
1.767
In [ ]:
# RMSE
mean_squared_error(y_test, y_pred_lasso, squared=False).round(3)
Out[ ]:
1.329
In [ ]:
# MAE
mean_absolute_error(y_test, y_pred_lasso).round(3)
Out[ ]:
1.004

Feature Importance Multiple Regression¶

In [ ]:
importance = np.abs(reg_multi.coef_)

df_imp = pd.DataFrame({"coeff": importance, 
                       "name": features}).round(3)
df_imp
Out[ ]:
coeff name
0 0.000 Migrationsquote2
1 0.064 Christenquote
2 0.428 Männerquote
3 0.147 Akademikerquote
4 0.044 Beamtenquote
5 0.203 Singlequote
In [ ]:
alt.Chart(df_imp).mark_bar().encode(
    x="coeff",
    y=alt.Y("name", sort='-x')
)
Out[ ]:

Feature Importance Lasso¶

In [ ]:
importance = np.abs(reg.coef_)

df_imp = pd.DataFrame({"coeff": importance, 
                       "name": features}).round(3)
df_imp
Out[ ]:
coeff name
0 0.003 Migrationsquote2
1 1.350 Christenquote
2 0.349 Männerquote
3 0.812 Akademikerquote
4 0.062 Beamtenquote
5 0.591 Singlequote
In [ ]:
alt.Chart(df_imp).mark_bar().encode(
    x="coeff",
    y=alt.Y("name", sort='-x')
)
Out[ ]:

Save model¶

Save your model in the folder models/. Use a meaningful name and a timestamp.

In [ ]:
TIME = "-" + time.strftime("%Y%m%d-%H%M")
PATH = "../models/"
FILE_CHR = "reg_model_linreg_christenquote"
FILE_MUL = "reg_model_multiplereg"
FILE_LAS = "reg_model_lassoreg"
FORMAT = ".pkl"
In [ ]:
joblib.dump(reg_chr, PATH + FILE_CHR + TIME + FORMAT)
joblib.dump(reg_multi, PATH + FILE_MUL + TIME + FORMAT)
joblib.dump(reg, PATH + FILE_LAS + TIME + FORMAT)
Out[ ]:
['../models/reg_model_lassoreg-20221228-2256.pkl']
In [ ]:
final_model_linreg = joblib.load(PATH + FILE_CHR + TIME + FORMAT)
final_model_multireg = joblib.load(PATH + FILE_MUL + TIME + FORMAT)

# pretend this is new data (3 observations)
new_data_linreg = X[['Migrationsquote2']].iloc[:3]
new_data =  X.iloc[:3]

# make predictions for the three observations
predictions_linreg = final_model_linreg.predict(new_data_linreg)
predictions_multireg = final_model_multireg.predict(new_data)
In [ ]:
predictions_linreg
Out[ ]:
array([7.20551964, 7.01455173, 7.15002819])
In [ ]:
predictions_multireg
Out[ ]:
array([6.56113549, 7.28797002, 6.50030877])

Conclusions¶

Um ein Verständnis für die Daten zu erhalten, beschreiben wir zuerst unsere bereinigte Datengrundlage, welche für die Anwendung der Modelle genutzt wird.

In [ ]:
alt.Chart(df_analyse_gemeinde, width=200, height=150).mark_bar().encode(
    alt.X(alt.repeat("repeat"), type="quantitative", bin=True),
    y='count()'
).repeat(
    repeat=Quoten,
    columns=3
)
Out[ ]:

Das Histogramm "Christenquote" weist eine linksschiefe, multimodale Verteilung auf. Die "Männerquote" weist eine annähernd symetrische, unimodale Verteilung auf. Alle weiteren Variablen sind rechtsschief, unimodal verteilt.

In [ ]:
# take a look at all correlations
corr.style.background_gradient(cmap='Blues')
Out[ ]:
  Migrationsquote2 Christenquote Männerquote Akademikerquote Beamtenquote Singlequote Arbeitslosenquote2
Migrationsquote2 1.000000 0.430310 -0.049660 0.093810 -0.014510 0.063220 -0.295450
Christenquote 0.430310 1.000000 0.153050 -0.253270 0.286860 -0.315040 -0.662740
Männerquote -0.049660 0.153050 1.000000 -0.303060 -0.091700 -0.274020 -0.233030
Akademikerquote 0.093810 -0.253270 -0.303060 1.000000 0.273150 0.239260 -0.102600
Beamtenquote -0.014510 0.286860 -0.091700 0.273150 1.000000 -0.039460 -0.256720
Singlequote 0.063220 -0.315040 -0.274020 0.239260 -0.039460 1.000000 0.444410
Arbeitslosenquote2 -0.295450 -0.662740 -0.233030 -0.102600 -0.256720 0.444410 1.000000

Die stärkste positive Korrelation, in dem untersuchten df_analyse_Gemeinde, zwischen Arbeitslosenquote und den Predictor Variables weist die Singlequote, mit r = +0.44441, auf. Die stärkste negative Korrelation mit der Arbeitslosenquote weist die Christenquote, mit r = -0.66274, auf. Die geringste Korrelation weist die "Akademikerquote", mit r = -0.102600, auf.

In [ ]:
alt.Chart(source, width=200, height=150).mark_circle(size=60).encode(
    alt.X(
        alt.repeat("repeat"), 
        type="quantitative",
        scale=alt.Scale(zero=False)),
    alt.Y('Arbeitslosenquote2'),
    tooltip = ['Name',alt.Tooltip(alt.repeat("repeat"), type="quantitative"), alt.Y('Arbeitslosenquote2')]
).repeat(
    repeat=predictor,
    columns=4
).interactive()
Out[ ]:

Conclusion Models¶

Lineare Regression¶

Folgende Statistiken wurden mit der lineraren Regression für die folgenden Quoten ermittelt:

Statistik Migrationsquote Christenquote Singlequote
R squared 0.027 0.382 0.121
R squard adj. 0.024 0.38 0.118
MSE 4.708 2.991 4.254
RMSE 2.17 1.729 2.063
MAE 1.686 1.296 1.492

Von diesen drei Modellen ist das Modell mit der Christenquote als Prädikator noch am Besten. Mit dem R squared zeigt sich trotzdem eine mäßige Güte des Models. Nur 38.2% der Variabilität der Arbeitslosigkeit wird hiermit erklärt.

Multiple Regression¶

Der R squared beträgt 0.636 und bedeutet eine mittlere Güte des Modells. Etwa 63.6 % der Variabilität der Arbeitslosigkeit wird durch die multiple Regression erklärt. Der adjusted R squared beträgt 0.629 und erklärt 62.9 % der Variabilität der Arbeitslosigkeit. Somit ist der adjusted R squared minimal schlechter als der R squared Wert.

Der mean sqaured error (1.762), root mean squared error (1.327) und der mean absolute error (1.004) ist niedriger als bei den Modellen der linearen Regression. Aus diesem Grund ist die multiple Regression der linearen Regression vorzuziehen.

Lasso Regression¶

Der R squared beträgt 0.635 und bedeutet eine mittlere Güte des Modells. Etwa 63.5 % der Variabilität der Arbeitslosigkeit wird durch die Lasso Regression erklärt. Der adjusted R squared beträgt 0.628 und erklärt 62.7 % der Variabilität der Arbeitslosigkeit. Somit ist der adjusted R squared minimal schlechter als der R squared Wert.

Der mean sqaured error (1.767), root mean squared error (1.329) und der mean absolute error (1.004) ist minimal niedriger als bei der multiplen Regression. Aus diesem Grund ist unterscheiden sich die Lasso und multiple Regression kaum.